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li?i3£ f DSNS (Dynamically-hazard-resolved,StaticaIly-code-scheduled,Nonuriiform Superscalar) 

Branch Pipeline of the DSNS Processor Prototype 
(in Japanese) 

Tetsuya HARA, Morihiro KUGA, Kazuaki MURAKAMI, and Shinji TOMITA 
Department of Information Systems 
Interdisciplinary Graduate School of Engineering Sciences 
Kyusyu University 
6-1 Kasuga-koen, Kasuga-shi, Fukuoka, 816 Japan 
e-mail : hara@is.kyushu-u.ac.jp 

A DSNS (Dynamically-hazard-resolved, Statically-code-scheduled, Nonuniform Superscalar) 
processor prototype, has been being built at Kyushu University. 

Control hazards due to branches cause a severe performance loss for superscalar 
processors. The DSNS processor prototype alleviates these effects with ©static branch prediction 
with branch-target-buffer, ©speculaitve execution, ©advanced conditioning, and ©early branch 
resolution. 

This paper presents the branch architecture and the branch pipeline of the DSNS processor 
prototype. ... 
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Branch Pipeline of DSN Superscalar Processor Prototype 

Tetsuya HARA, Morihiro KUGA, Kazuaki MURAKAMI and Sinji TOMITA 
(Department of Information Systems, Interdisciplinary Graduate School of 
Engineering Sciences, Kyushu University) 

At present, we are developing a superscalar processor based upon DSNS (Dynamically- 
hazard-resolved, Statically-code-scheduled, Nonuniform Superscalar) architecture. In the 
superscalar processor, branch instructions and the existence of control dependence, which 
result from the branch instructions, may cause branch penalties, such as interference in an 
instruction fetch, interference in the execution of successor instructions, disarray of a 
pipeline due to a branch delay, and the reduction of the parallelism of an instruction level. 
These result in a remarkable reduction of performance. In order to alleviate the influence 
of branch penalties, in the DSNS processor, techniques using branch architecture, such as 
® static branch prediction with a branch-target-buffer, © speculative instruction 
execution, © advanced conditioning and © early branch resolution, is adopted. 

In the present paper, the branch architecture and the pipeline processes of the branch 
instructions are described. 
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A DSNS (Dynamically-hazard-resolved, Statically-code-scheduled, Nonuniform 
Superscalar) processor prototype has been built at Kyushu University. 

Control hazards due to branches cause a severe performance loss for 
superscalar processors. The DSNS processor prototype alleviates these effects 
with © static branch prediction with branch-target-buffer, © speculative execution, 
© advanced conditioning and © early branch resolution. 

This paper presents the branch architecture and the branch pipeline of the DSNS 
processor prototype. 
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1. Introduction 

We are developing a processor, where a 
superscalar method has been adopted as 
processor architecture where the 
parallelism of an instruction level is 
utilized. This processor has the following 
characteristics [3]: 

(1 ^ Dynamic hazard resoluti on; 

Hazards which result from the 
dependency between instructions and a 
resource competition, are detected by 
hardware, and are resolved. This 
resolution results in the guarantee of the 
compatibility of an object code. 

(2) Static code scheduling; 

If the resource competition or/and the 
dependency between instructions exist, 
the number of instructions, which can be 
executed at the same time, decreases, 
resulting in the reduction of the 
parallelism of an instruction level. For the 
purpose of permitting a pipeline flow 
without stagnating, it is necessary to 
rearrange the (issuing) order of 
instructions (code scheduling) and to 
increase the number of instructions that 
can be simultaneously executed. This 
code scheduling is statically performed 
upon compiling, and attempts to increase 
parallelism. At the same time, a hardware 
increasing problem, caused by the 
dynamic code scheduling, is avoided. 

(3) Heterogeneous functional units: 

An instruction fetch mechanism, a 
decoding mechanism, a register port and 
path, etc. are multiplexed by following the 
superscalar multiplicity. However, only 
essential functional units are multiplexed. 
Because only the functional units that are 



frequently used are multiplexed, cost 
performance is excellent 

Based upon the above-mentioned 
characteristics, the processor, currently 
being developed is referred to as a DSNS 
(Dynamically-hazard-resolved, Statically- 
code-scheduled, Nonuniform Superscalar) 
processor. 

In an instruction pipeline processor, its 
performance decreases due to branch 
instructions and the existence of control 
dependence, which results from the 
branch instructions. In a superscalar 
processor, the influence of branch 
penalties is remarkable. The primary 
causes of the branch penalties are as 
follows: 

® Interference in instruction fetch: a 
successor instruction(s), which should 
be fetched next, cannot be determined 
until deciding whether or not 
branching is performed (in the case of 
the conditional branching) and until a 
branch target address becomes definite. 

© Interference in the execution of 
successor instructions due to control 
dependence: If the successor 
instruction which should be fetched 
next is predicted and fetched, the 
execution of the successor instruction 
cannot be started and completed. This 
is because if instructions in the control 
dependency (in other words, 
instructions for which it has not yet 
been decided whether they should be 
executed), update the register contents 
etc., a precise machine state cannot be 
guaranteed. 

® Disarray of pipeline due to branch 
delay: In addition, when the branch 
prediction is incorrect, the pipeline has 
to be flashed and a correct instruction 
has to be re-fetched. Therefore, the 
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longer the delay time for the execution 
of the branch instruction becomes, the 
more the disarray of the pipeline 
worsens. 

© Instruction misalignment [1 ] [1 5] : 
This is a problem only in the 
superscalar processor. In other words, 
when the branch target address does 
not match an instruction fetch 
boundary, an unnecessary instruction 
is included in the fetched instruction, 
reducing the parallelism of an 
instruction level. 

For the purpose of alleviating the 
influence of branch penalties in the above 
®, CD and ®, in the DSNS processor, the 
following techniques have been adopted: 

(a) Static branch prediction with a 
branch-target-buffer; 

(b) Speculative instruction execution; 

(c) Advanced conditioning; and, 

(d) Early branch resolution. 

Furthermore, concerning instruction 
misalignment in the above ® , it is 
possible to manage with hardware [1]. 
However, the DSNS processor entirely 
relies upon a compiler. 

In the present paper, after Action on 
Branch Penalties (Chapter 2) is organized, 
Branch Architecture (Chapter 3) adopted 
in the DSNS processor, Outline of DSNS 
Processor (Chapter 4) and Pipeline 
Process of Branch Instructions (Chapter 5) 
are described. 

2. General Action on Branch 
Penalties: 

2. 1 Managing Branch Penalties : 

As a method for reducing the influence of 
branch penalties which may occur due to 



the existence of branch instructions, many 
methods have been proposed as follows. 
Furthermore, the methods in the below- 
mentioned (1) through (4) have a 
substantially orthogonal relationship, and 
when realized, various combinations are 
possible. 

(1) Branch method: 

At first, as a method different from a 
normal compare-and-branch method or a 
branch-on-condition method, there are the 
following methods: 

® Advanced conditioning (advanced 
conditioning ) [2] [12]: conditioning 
(whether or not branching is 
performed) is performed in advance of 
branching, and attempts to reduce 
branch delays accompanied with the 
execution of branch instruction(s). 

© Prepare-to-branch method [9]: A 
branch target instruction is fetched in 
advance according to a prepare-to- 
branch instruction (prepare-to-branch 
instruction), and is buffered. This 
results in the restraint of the disarray 
of the pipeline when the branch 
instruction is 'TAKEN'. 

(2) Branch instructions: 

As functions for the branch instructions, 
the following are added: 

® Delayed branch [11]: A delay slot of 
the branch instruction is defined, 
designed such that an instruction(s) 
within the delay slot is not affected by 
control dependence. In other words, 
regardless of the branch result 
(TAKEN/ NOT-TAKEN), the 
instruction(s) is always executed. If 
the delay slot can be completely filled 
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in, branch penalties will not occur. 
However, it is difficult to discover an 
instruction that will not be affected by 
control dependence. Then, a delayed 
branch with quashing (delayed branch 
with squashing ) is also proposed 
where an instruction(s) within a delay 
slot will be quashed depending upon 
the branch result [10]. 
® Static branch prediction [10]: A 
compiler performs the branch 
prediction relative to each conditional 
branching instruction, based upon a 
meaning analysis result(s) and profile 
information. The prediction results 
reflect an OP code, preventing 
interference in the instruction fetch. 

(3) Instruction fetch: 

Methods to perform the instruction fetch 
in succession are as follows: 

® Dynamic branch prediction [8] [13]: 
Differing from the static branch 
prediction, the branch prediction is 
performed when hardware is in 
operation. There are additionally the 
following techniques in this method: 

(a) 'Not-taken* prediction: Predicting 
'not-taken' results in the 
successive fetch of an instruction 
flow in a non-branch target. 

(b) 'Taken' prediction: It is designed 
such that a prediction 'taken' shall 
result in fetching an instruction 
flow in a branch target. 

(c) Branch prediction buffer (BPB: 
Branch Prediction Buffer ): A 
history upon the execution of each 
branch instruction is buffered, and 
the branch prediction is performed 
based upon the history. 

(d) Branch target buffer (BTB: Branch 
Target Buffer )[13]: A branch 



target address or a branch target 
instruction is additionally buffered 
to the BPB, enabling the 
immediate provision of the branch 
target instruction without address 
calculation, when 'Taken' is 
predicted. 

© Fetch of multiple instruction flows [8] 
[9]: Both of instruction flows in a 
non-branch target and a branch target 
are fetched. 

(4) Execution of Bran ch Instruction: 

For the execution of a branch instruction, 
the following care-and-attention is 
performed: 

® Early branch resolution (early branch 
resolution ) [9]: A branch instruction 
is executed in the early stage of the 
instruction pipeline. This attempts to 
shorten the existence period of the 
control dependence and the execution 
time delay of the branch instruction. 

© Branch instruction folding (branch 
folding) [6]: An instruction pre-fetch 
stage is established before a normal 
instruction pipeline, and any 
unconditional branch instruction shall 
be executed there. Therefore, the 
unconditional branch instruction will 
never enter into the instruction 
pipeline. However, a conditional 
branch instruction shall be executed in 
the instruction pipeline. Concerning 
the unconditional branch instruction, 
no branch penalty will occur. 



2.2 Management to control 
dependence: 
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Successor instructions of a branch 
instruction are generally in control 
dependency with the branch instruction. 
In other words, even if the successor 
instructions of the branch instruction can 
be fetched, whether or not these are 
executed will not be determined until the 
control dependence is resolved. Therefore, 
concerning the processing of the 
instructions in the control dependency 
within an instruction pipeline, special 
consideration is required, in regard to 
which there are at least the following two 
methods: 

(1) Nonsp ecnlative execution: 

The execution of the instructions, which 
are in the control dependency, shall not 
start until the control dependence is 
resolved. Its realization is easy. However, 
no instruction above that of the branch 
instruction can be executed. In other 
words, simultaneously executable 
maximized instructions are limited to one 
basic block. Therefore, when the number 
of instructions within a basic block is 
small, the parallelism of the instruction 
level is reduced. 

(2) Specula tive execution: 

Even though the control dependence is not 
resolved, the execution of instructions, in 
the control dependency, is started. It is 
prohibited to update register contents, etc. 
with the execution results of the 
instructions as long as the control 
dependence is not resolved. If it turns out 
to be unnecessary to execute the 
instructions as a result of the control 
dependence resolution, the instructions 
themselves and the execution results have 
to be quashed. Therefore, its realization 
becomes slightly complicated as 
mentioned below. However, instructions 



above the branch instruction become 
executable, and the improvement of the 
parallelism of the instruction level can be 
anticipated. 

There are two specific realization methods 
of the speculative execution, as follows: 

® Conditional execution mode 

(conditional mode): Instruction(s), 
which was fetched according to a 
branch prediction, is placed in the 
conditional execution mode until the 
control dependence, which results 
from its correspondent branch 
instruction, is resolved. It is possible 
to execute any instructions in the 
conditional execution mode. However, 
the register contents, etc. cannot be 
updated with the execution results. 

® Boosting (boosting) [16]: A compiler 
determines whether or not each 
instruction is placed in the conditional 
execution mode. The instructions in 
the conditional execution mode are 
referred to as 'boost instructions,' and 
their designations are performed by 
the OP code. The boost instructions 
are positioned above the object code 
and before their correspondent branch 
instructions. In other words, contrary 
to the above-mentioned mode ®, the 
instructions in the conditional 
execution mode are fetched before 
their correspondent branch 
instructions, and execution is started. 
Therefore, depending upon static code 
scheduling, the efficacy of the 
speculative execution can be 
anticipated according to the above- 
mentioned mode ®. 

Further, as a means for preventing the 
update of the register contents, etc. by the 
instructions in the conditional execution 
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mode (including the boost instructions) 
without waiting for the resolution of the 
control dependence, there are the 
following means: 

(a) Buffer method [14]: A buffer is 

established in a register file. For how 
to use this buffer, there are the 
following two types: 

(i) Recorder buffer [16] [14]: 
Execution results of instructions, 
which are in the conditional 
execution mode, are buffered. 
Then, when the control 
dependence is resolved, if the 
branch prediction is correct, the 
contents are incorporated to the 
register, and a bypass mechanism 
is required to be able to fetch an 
operand from the buffer. 

(ii) History buffer [14]: This allows 
the register contents to be updated 
with the execution results of 
instructions in the conditional 
execution mode, on which 
occasion the original register 
contents must be buffered in ths 
history buffer. When the control 
dependence is resolved, if the 
branch prediction is incorrect, the 
register contents are restored by 
the original contents buffered in 
the history buffer. 

(b) Future file [1 4]: Two sets of the same 
register files are established, one of 
which is regarded as a register file 
defined in terms of the architecture 
(architectural file), and the other being 
regarded as a file where execution 
results of instructions in the 
conditional execution mode, are 
written (future file). When the control 
dependence is resolved, if the branch 
prediction is correct, the contents in 
the future file will be incorporated in 



the architectural file. For this 
incorporation method, there are a 
method by a transfer between files 
[14] [1 5], and another method by 
switching a register access path [4], 
described in Section 3.2. 
(c) Checkpoint repair [7]: Contents in a 
register at a point when the control 
dependence occurs (checkpoint) are 
stored. When the control dependence 
is resolved, if the branch prediction is 
incorrect, the contents in the register 
are restored with the [stored] contents. 

3. Branch Architecture: 

Among the actions mentioned in Chapter 
2, in the DSNS processor, the following 
techniques have been adopted: 

® Static branch prediction with a branch 

target buffer; 
© Speculative execution: Conditional 

execution mode and boosting; 
® Advanced conditioning; and 
© Early branch resolution. 

3 . 1 Static branch prediction with a 
branch target buffer: 

In the DSBS processor, the branch target 
buffer (BTB) conventionally used in 
combination with a dynamic branch 
prediction is combined with static branch 
prediction. 

A compiler designates either of the 
following types to each branch instruction, 
based upon the results of static branch 
prediction: 

<D BTB registration type: A branch 

instruction, where it is decided that its 
branch prediction is 'taken', and that 
its branch target address is fixed, is 
this type. The branch target address 
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generated upon execution, is 
registered in the BTB. 
© BTB non-registration type: A branch 
instruction, where it is decided that its 
branch prediction is 'not-taken', or 
that the prediction is 'taken' but its 
branch target address will change, is 
this type. The branch target address is 
not registered to the BTB. 

When the branch target address is 
registered to the BTB, its correspondent 
branch instruction is predicted as 'taken', 
and its registered branch target address is 
used for an instruction fetch in the next 
cycle. 

The configuration of the BTB s described 
in Section 4.1. Further, the registration 
process of the branch target address to the 
BTB and the instruction fetch process 
using the BTB are described in Section 
5.2. 

3 .2 Speculative execution 

As a speculative execution method, both 
the conditional execution mode and the 
boosting described in Section 2.2, are 
adopted. The realization method of both 
methods on the hardware is basically 
equivalent, and the hardware mechanism 
for the conditional execution mode can 
also be diverted to that for boosting. 

As a means of preventing the update of 
register contents, etc. by instructions 
(including boost instructions), placed in 
the conditional execution mode, without 
waiting for the resolution of control 
dependence in the DSNS processor, a 
multiplexed register file [4] has been 
adopted. This is one realization method 
for the future file [14]. Herein, the 
multiplicity of the multiplexed register file 
becomes the "level of the conditional 



execution mode +1". The level of the 
conditional execution mode in the DSNS 
processor is T, so the multiplicity 
becomes '2' (refer to Section 3.4). 
Hereafter, it is referred to as a 'dual 
register file (DRF: Dual Register File)'. 

In the dual register file, there is logically 
one entity in each register. However, 
there are physically two entities (physical 
register). Temporarily, one becomes a 
current state, and the other becomes an 
alternate state. In the alternate state, 
additional two states, which are valid and 
invalid, exist Further, the physical 
registers in the current state and in the 
alternate state are referred to as a 'current 
register' and an 'alternate register' as a 
matter of convenience, respectively. 

The outline of operations of the dual 
register file is described as follows: 

H } Reading out from register: 

An instruction placed in the unconditional 
execution mode (unconditional mode: 
state where this instruction is not in 
control dependency) reads out its source 
operand from a current register. In the 
meantime, an instruction placed in the 
conditional execution mode obtains its 
source operand as follows. 

® When an alternate register is valid: 

Read out from the alternate register. 
© When an alternate register is invalid: 

Read out from a current register. 
(2) Writing into register: 

An execution result(s) of an 
instruction, which was completed in the 
unconditional execution mode, is written 
into a current register; in the meantime, an 
execution result(s) of an instruction, 
which was completed in the conditional 
execution mode, is written into an 
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alternate register, and the alternate register 
is set as the valid state. 

(3} Execution completion of branch 
instruction. 

When the control dependence is resolved, 
the following state transition occurs to 
each register: 

® When the branch prediction is correct: 
A valid alternate register becomes a 
current state, and a current register, 
which is a counterpart of the alternate 
register, becomes an invalid state. No 
states change in registers other than 
these two registers. 

© When the branch prediction is 
incorrect: All alternate registers 
become an invalid state. No states 
change in current registers. 

3 .3 Advanced conditioning 
3.3.1 Conventional conditional 
branching method: 

Conditional branching is generally 
comprised of the following processes: 

® Condition generation: A condition 
that causes the conditional branching 
is generated. The execution of 
arithmetic and logical operation, etc. 
results in the generation of this 
condition. 

© Conditioning (conditioning ): The 
condition, which was generated in the 
process ®, is tested under the branch 
condition, and whether or not 
branching is performed should be 
decided. 

The above-mentioned processes ® and © 
are performed regardless of whether or 
not branching is performed. In the case of 
branching is performed, the following 
processes are additionally required: 



® Branch target address generation: An 
instruction address for a branch target 
is calculated by following an 
addressing mode. Furthermore, this 
process can be omitted depending 
upon the addressing mode. 

® Branching process: The branch target 
address, which was generated in the 
process ® , is set to a program counter 
(PC). 

In the meantime, there are the following 
two conventional general conditional 
branching methods: 

(1) Comoa re-and-branch method: 

As shown in Table 1, all of the conditional 
branching processes ®, ©, ® and © are 
performed by one compare-and-branch 
instruction. Because its critical path is 
long, which is ® -> © -* ©, there is a 
defect such that the branch delay becomes 
longer. 

(2) Branch -on-condition method: 
Using a condition code (CC: 

Condition Code ), the process ® and the 
processes ©, ® & ® are performed by 
separate instructions. As shown in Table 
1 , the process ® is performed by the CC 
setting in the normal arithmetic/ logical 
operation instruction, and the processes ©, 
® & © are performed by one branch-on- 
condition instruction. The critical path of 
the branch-on-condition instruction itself 
is shorter, which is © — © or © -» ©, so 
the branch delay is shorter compared to 
the compare-and-branch instruction. 
However, the introduction of the CC 
results in the creation of new problems, 
such as the detection of a flow 
dependency concerning the CC and the 
avoidance of access competition to the CC. 

Table 1 : Conditional branching method 

r-rtvoo^az I <D I <P 1 P Branch I <P I 
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3.3.2 Advanced conditioning 

Other than the conditional branching 
methods described in the previous section, 
there is advanced conditioning (advanced 
conditioning) [12] [2]. In the DSNS 
processor, this method is adopted. In this 
method, as shown in Table 1, process ® is 
performed by the CC setting in the normal 
arithmetic/ logic operation instruction, as 
similar to the branch-on-condition method. 
However, process © and processes ® & 
© are performed by separate instructions. 
Processes ® & © are performed by one 
branch instruction, and its critical path is 
short, which is ® — > ©. 

In connection with the introduction of the 
advanced conditioning, the following two 
types of registers are defined: 

(a) TF register (TFR: True/ False 
Register): This is a register that 
contains 32 of 1 bit length, and 
"whether or not branching is 
concluded (True = TAKEN)/ (False = 
NOT TAKEN)" is maintained. 

(b) Tagged general-purpose/ floating 
point register: Each general-purpose/ 
floating point register is tagged with 
4-bit. The CC, which was generated 
due to the execution of the arithmetic/ 
logic operation instruction, is stored in 
the tag of its destination register (refer 
to Fig. 1). This results in the 
resolution of the problem concerning 



the CC with the conventional branch- 
on-condition method [2]. 

Further, as a related instruction, the 
following three types of instructions are 
defined (refer to Fig. 1): 

(a) Test (test) instruction: A test whether 
or not the branching condition is 
matched with the CC within a source 
register is conducted, and the 
true/false value for whether or not 
branching is performed is set to a 
destination TF register. This is 
equivalent to the process ® in the 
conditional branching processes. 

(b) Compare-and-test (compare-and-test) 
instruction: The arithmetic and logic 
comparison between two source 
operands are conducted, and a test for 
whether or not the result is matched 
with the branching condition is 
conducted, and, a true/false value is 
set to the destination TF register. This 
is equivalent to the processes ® and 
© in the conditional branching 
process. 

(c) Branch (branch) instruction: A branch 
result (TAKEN/ NOT TAKEN) is 
decided by following a true/ false 
value in a source TF register When 
the result is TAKEN', a branch target 
address is additionally set to the 
program counter (PC). The branch 
target address is calculated by 
following the addressing mode (PC 
relativity/ GR relativity/ PC + GR). 
This is equivalent to the processes ® 
and © in the conditional branching 
process. 

Other than these instructions, an 
instruction that performs a logic operation 
between two TF registers is also prepared 
for multidirectional branching. 
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According to advanced conditioning, the 
following advantages can be anticipated: 

• Scheduling to have a great distance 
between the test instruction or the 



TF register file 



• Using the logic operation instruction 
between the TF registers enables the 
unification of multiple branch 
instructions into one branch 
instruction. In other words, it is 
possible to reduce the number of 
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compare-and-test instruction and the 
branch instruction enables the 
concealment of the branch delay 
associated with the conditioning 
(whether or not branching is 
performed). 



FIG. 2: Data path In DSNS processor 
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FIG. 1: advanced conditioning 



be described in Section 3.4. As a 
result, the branch delay itself can be 
reduced. 

3.4 Early branch resolution 

In addition to the methods described in 
Sections 3.1 through 3.3, it is important to 
speed up the execution itself of a branch 
instruction; in other words, to resolve the 
branch in the early stage for the following 
reasons: 

• When the branch prediction is 
incorrect, quashing all instructions, 
placed in the conditional execution 
mode, results in the restoration of the 
instruction pipeline (pipeline flash). 
Therefore, the later the control 
dependence, which results from the 
branch instruction, is resolved, the 
more instructions shall be quashed. 
Naturally, the branch penalties shall 
also increase. 

• When a new branch instruction is 
fetched in the conditional execution 
mode, the execution of the branch 
instruction and its successor 
instruction(s) cannot be started in the 
level 1 conditional execution mode. 
This causes the lack of level in the 
conditional execution mode. However, 
the increase of the level in the 
conditional execution mode means the 
increase of the multiplicity of a 
multiplexed register files (refer to 
Section 3 .2), and it is directly related 
to the increase of the hardware. 
Therefore, it is difficult to increase the 
level in the conditional execution 
mode to 2 or higher. 

According to this background, in the 
DSNS processor, early branch resolution 
(early branch resolution ) is adopted. As 
described in Section 5.2, general 
instructions other than branch instructions 



are executed in the E (execution) stage of 
the instruction pipeline; in the meantime, 
the execution of the branch instruction is 
started in the D (decoding) stage, which is 
previous to the E stage. 

However, as explained above, for the 
purpose of executing branch instructions 
in the different stage from the stage where 
the general instructions are executed, the 
problem occurs where extra hardware is 
required. However, in the DSNS 
processor, as described in Section 3.3, 
processes which should be performed by 
branch instruction are less, which are only 
the TFR reference and the generation of a 
branch target address. Therefore, 
necessary hardware capacity is not that 
much. Hardware that is exclusively used 
for branch instructions is established as a 
branch unit, described in Section 5.1. 

3 .5 Specification of branch 
instructions 

The specifications of the branch 
instructions in the DSNS process are as 
follows: 

U)Type; 

As described in Section 3.1 , the branch 
instructions are classified into two types, 
the BTB registration type and the BTB 
non-registration type. Furthermore, all 
branch instructions are conditional 
branching instructions. Whether or not 
branching is performed is decided 
according to a TFR value, which is a 
source operand. The TFRO value is 
always True (= TAKEN), and the branch 
instructions that designate this value to the 
source operand become unconditional 
branch instructions. 
(2) Addressing mode: 



(11) 



As an addressing mode when generating a 
branch target address, the following three 
types are prepared: 

® PC (program counter) + offset (signed 
19 bits) 

® GR (general-purpose register) + offset 

(signed 14 bits) 
© PC + GR 

Furthermore, because the instruction 
length is 4 bytes, low-order 2 bits in an 
instruction address is always *00\ 
Therefore, in the calculation for a branch 
target address, only the high-order 30 bits 
shall be generated. 

4. Outline of processor 

In order to verify the validity of the DSNS 
architecture, we have been developing a 
prototype processor, "DSNS processor", 
based upon the DSNS architecture. In this 
chapter, the outline of the configuration of 
the DSNS processor and the instruction 
pipeline processes shall be described [4]. 

4.1 Configuration of DSNS processor: 

The DSNS processor is comprised of the 
following primary units. 

(1 ^ Instruction cache ( TC: Instruction 

Cache); 

The port size is 16 bytes, and an 
instruction block comprised of four 4-byte 
length instructions, is simultaneously 
fetched. This is a virtual address cache 
using a direct mapping method with 64 
bytes (= four instruction blocks) of the 
line size. 

Each instruction block is equipped with a 
branch target buffer (BTB: Branch Target 
Buffer ) where one prediction branch 



target address is maintained. This enables 
the successive fetching of the instruction 
blocks regardless of the existence of 
branch instructions. The BTB and the 
instruction cache share a tag array. 

(2^ PreDecnder fPD: P reDecoder Y 

Four fetched instructions are pre-decoded, 
and information necessary for a conflict 
check, is obtained. Further, all 
information concerning the execution of 
branch instructions is also generated here. 

(3) Decoder (D Decoder): 

Information necessary for the execution 
control of each instruction, especially, 
control information of the functional units 
is read out from a ROM. 

(4) Conflict checker ( CC Conflict 
Checker): 

Based upon the results of pre-decoding, 
the following are performed to four 
instructions at maximum: 

• Detection of a flow dependence and 
an output dependence between 
instructions within an instruction 
block; 

• Detection of a flow dependence and 
an output dependence relative to a 
successor instruction block; 

• Detection and adjustment of 
functional unit competition; and 

• Allocation of read-out report in a 
register file. 



^Branch unit fBU: Br anch UnitV 
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This is a unit exclusively used for the 
execution of branch instructions for the 
purpose of the early branch resolution, 
and it is constructed with three stages of 
pipelines. Further, it is equipped with a 
conflict checker that is exclusively used 
for branch instructions (BCC), and four 
stages of branch instruction buffers (BIB: 
Branch Instruction Buffer) for the 
purpose of successively executing branch 
instructions. The details are described in 
Chapter 5. 

(6)Functional Units fFUs; Funct ional 
Units ) 

As shown in Fig. 2, it is equipped with 
the following four systems, a total of 1 3 
functional units. Four instructions are 
issuable to a combination of optional 
functional units per cycle at maximum. 

© Integral number functional units: 

ALU (X2), shifter, multiplier 
© Floating point functional units: ALU, 

multiplier, divider, type converter 
© Load/store functional units: two 

independent load/store units [5] 
© Branch functional units: re-fetch 

address generator, two units for the 

advanced conditioning 

(7^Dual regi ster files (DRFs: Dual 
Register Files ) 

As shown in Fig. 2, it is equipped with 
three systems of register files, which are 
® general-purpose register (GR), © 
floating point register (FPR) and © TF 
register (TFR). 

mDual-por t data-cache fDPDC: Dual- 
Port Data Cache H51 
In order to respond to a dual load/store 
pipeline where integral number data and 
floating number data can be processed, a 



dual-port is realized. The port size is 8 
bytes, and it is a virtual address cache 
with a direct mapping method. 

Further, it is a non-blocking cache 
where a successor load/store 
instruction(s) is receivable even if a 
'mishit' occurs. 

4.2 Instruction pipeline processes: 
The instruction pipeline has a 4-stage 

construction as follows: 

® IF: Instruction block fetch + pre- 
decode; 

© D: Conflict check + decode + 
operand fetch; 

® E: Execution; and 

© W: Writing. 
For the pipeline cycle time, we aim at 60 
ns. 

Fig. 3 shows the instruction pipeline 
processes per instruction type. The 
processing of the branch instructions shall 
be described in Chapter 5. 
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FIG. 3: Pipeline processes 

5. Branch pipeline 

The processing of the branch instructions 
is performed by the branch pipeline after 
the branch prediction in the IF stage and 
the setting of the conditional execution 
mode. The execution entity of the branch 
pipeline is a branch unit, comprised of 
three stages, BD, BE and BW. 

5.1 Branch unit 

The branch unit (BU: Branch Unit ) is 
equipped with the following two units and 
a control mechanism for the branch 
pipeline. 



Q ) Branch instruction buffer (BIB: 
Branch Instruction Buffer ) 



Branch instructions basically have to be 
successively executed, so a branch unit 
performs no more than one branch 



processing at a given time. Therefore, in 
case multiple branch instructions exist in 
the same instruction block, the branch unit 
is provided with a branch instruction 
buffer (BIB: Branch Instruction Buffer ) 
that maintains four instructions. In the 
branch unit, after an instruction block that 
includes branch instructions is buffered to 
the BIB once, the branch instructions are 
executed one by one in order. 

(2) Conflict checker for branch 

instructions (BCC: Bran ch Conflict 
Checker) 

When the D stage in a main pipeline and 
the BD stage are synchronized, the branch 
instructions being processed at the branch 
unit also exist in the D stage of the main 
pipeline. Therefore, for the detection of 
data dependence, the CC in the main 
pipeline is used. However, because the 
processing for the branch unit and the 
main pipeline is independent from each 
other, when the BD stage is interlocked 
due to the flow dependence, there is the 
possibility that the instruction block where 
the branch instruction belongs advances to 
the E stage or beyond, and that there is a 
situation where a successor instruction 
block(s) exists in the D stage. In other 
words, the branch instruction does not 
exist in the D stage, which is for general 
instructions, so the CC cannot be used. 
Consequently, the branch unit is equipped 
with a conflict checker exclusively used 
for branch instructions (BCC: Branch 
Conflict Checker), and the data 
dependence when the CC cannot be used 
as mentioned above is detected. The BCC 
detects data dependence relative to a 
precedent instruction block(s), similar to 
the CC for general instructions. However, 
it does not detect any data dependence 
between instructions within an instruction 
block. 
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Further, since the functional units (re- 
fetch/ address generator) and a read-out 
port in a register file are exclusively used 
for the branch unit, it is unnecessary to 
perform: 

• Detection and adjustment of 
functional unit competition, and 

• Allocation of read-out port in the 
register file. 

5.2 Branch instructions processes 
5 .2.1 IF stage 

An IF stage is a common stage with 
general instructions except for branch 
instructions. Concerning branching, the 
following processing is performed, and 
the results are transmitted to the D stage 
of the main pipeline and the BD stage of 
the branch pipeline. 

(1 ^ Branch prediction 

Four instructions (instruction block) from 
an instruction cache are pre-fetched 
according to a PFC (pre-fetch counter) 
value; simultaneously, a correspondent 
BTB entry is read out. If the entry is valid, 
a branch instruction(s) that exists within 
the instruction block is predicted as 
'TAKEN'. In this case, the pre-fetch for 
the next cycle is performed using a branch 
target address, registered in the BTB entry. 

(2) NOP r eplacement 

When the branch prediction is 'TAKEN', 
all successor instructions of branch 
instructions, which exist in the instruction 
block, and which correspond with the 
branch prediction, are regarded as NOP. 



(3) Detection of contr ol dependence and 
interlock 

As a result of pre-decoding, if the 
existence of a branch instruction(s) or a 
boost instruction(s) is ascertained, the 
conditional execution mode is set and the 
interlock control is performed as follows: 

<D When a branch instruction, which has 
not yet completed the execution, exists 
in an instruction pipeline: 

(i) When a branch instruction exists 
within an instruction block: the IF 
stage is interlocked until the 
execution of the branch instruction 
in the branch pipeline is completed. 
When the execution is completed, 
the processing © is performed. 

(ii) When no branch instruction exists 
within an instruction block, all 
instructions are placed in the level 
1 conditional execution mode, and 
the processing progresses to the D 
stage. 

© When no branch instruction, which 
has not yet completed the execution, 
exists in an instruction pipeline: 
(i) When a branch instruction exists 
within an instruction block: all 
instructions that are a successor of 
the branch instruction are placed in 
the conditional execution mode. 
When multiple branch instructions 
exist in the same instruction block, 
the level of the conditional 
execution mode is increased in 
proportion to the number of the 
branch instructions. In other 
words, the 1 st branch instruction 
and its successor instructions will 
be set to level 1, and the 2 nd branch 
instruction and its successor 
instructions will be set to level 2. 
Even if instructions, which are in 
the level 2 or higher conditional 
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execution mode, exist, in the IF 
stage, these instructions shall not 
be interlocked. However, the issue 
of these instructions shall be 
blocked in the D stage, 
(ii) When no branch instruction exists 
[within an instruction block]: all 
instructions within an instruction 
block are placed in the 
unconditional execution mode. 

Concerning boost instructions, the level of 
the conditional execution mode set in the 
above-mentioned cases ® and ©, is 
additionally increased by 1 and is set. 

5.2.2 Branch pipeline processes 

Processing in a branch pipeline is 
controlled by the following factors: 

(a) Branch prediction (taken/ not- 
taken) 

(b) Addressing mode (GR access is 
necessary/ unnecessary) 

(c) Type of branch instruction (BTB 
registration type/ non-registration 
type) 

(d) Branch result (TAKEN/ NOT- 
TAKEN) 

The above-mentioned factors (a), (b) and 
(c) are ascertained in the IF stage, and the 
above-mentioned (d) is ascertained in the 



BD stage. 

Fig. 4 shows the outline of processing of 
the branch pipeline. 

(1) Resolution of cont rol dependence 

As a result of the detection of data 
dependence, when there is no flow 
dependence in a TFR, which is a source 
operand of the branch instruction, the TFR 
is read out. The read-out TF value is a 
branch result Comparing this value to a 
prediction result, the propriety of the 
branch prediction is decided. 

(a) When the branch prediction is correct 
(the cases ®, ©, ©, © and ® in Fig. 
4): the conditional execution mode of 
all instructions is reduced by 1 level. 
In other words, instructions that are 
executed in the level 1 conditional 
execution mode is placed in the 
unconditional execution mode, and 
instructions that are executed in the 
level 2 conditional execution mode is 
placed in the level 1 conditional 
execution mode. Further, concerning 
the dual register file, as described in 
Section 3.2, a valid alternate register is 
switched to a current register. In the 
cases ®, ©, © and ® in Fig. 4, the 
interlock of the stage is released in the 
BD stage, and in the case © , it is 




(a) Branch prediction 

(b) Addressing mode 

(c) Type 



r 

(d) Branch results 



aTTBt BTB regirtntion 

CC I Conflict checker} 

1KM ; Re-firtch iddreti generation 

GAP i GexuraJ-porpoM register file 

WF t bwtroction re-firtch (' Re- infraction Fetch*] 

TA : Branch target eddies* 



FIG. 4: Branch pipeline processes 



released in the BE stage, respectively, 
(b) When the branch prediction is 

incorrect (the cases ®, ®, ©, ® and 
®): all instructions that are executed 
in the conditional execution mode are 
quashed. Further, an alternate 
register(s) is also quashed. 

(2^ Re-fetch address ge neration 

A re-fetch address, which becomes 
necessary on the occasion that a branch 
prediction is incorrect, is generated. The 
re-fetch address is calculated using an 
RAG (Re-fetch Address Generator), 
which is one of the functional units. The 
generation processing of the re- fetch 
address is managed by the prediction 
result and the addressing mode, as 
follows: 

(a) When the branch prediction is 'not- 
taken': A branch target address is 
generated as a re-fetch address. 

(i) When the addressing mode is the 
PC relativity (the cases ® through 
© in Fig. 4): The address 
calculation is performed in the BD 
stage. 

(ii) When the addressing mode is the 
GR relativity and PC + GR (the 
cases © through ®): The access 
to the GR is performed in the BD 
stage, and the address calculation 
is performed in the BE stage. 

(b) When the branch prediction is 'taken' 
(the cases ® and ®): A non-branch 
target address is generated as a re- 
fetch address. Since the access to the 
GR is unnecessary, the address 
calculation is always performed in the 
BD stage. Further, among instructions 
that are the BTB registration type of 
branch instructions, but are not yet 
registered to the BTB (in other words, 
instructions that are predicted as 'not- 



taken'), their branch target address is 
registered to the BTB, respectively, 
regardless of the branch result 
(TAKEN or NOT-TAKEN) (the cases 
®, ©, © and ® in Fig. 4). 

(3) Pipeline restorati on processing 

When the branch prediction is incorrect, 
an instruction(s) fetched because of the 
wrong prediction, and its execution 
result(s) are quashed (pipeline flash); 
simultaneously, a correct instruction flow 
is re-fetched (FJF: Re-Instruction Fetch) 
using a re-fetch address (the cases ® , ® , 
©, ® and ®). 

5.3 Branch penalties 

When it is assumed that there is no flow 
dependence relative to a source operand 
of a branch instruction, the disarray of the 
pipeline, which results from the branch 
instruction (branch penalties), will be as 
follows. 

(a) When the branch prediction is correct 
(the cases ®, ® , ©, © and ® in Fig. 
4): There is no penalty. 

(b) When [the addressing mode is] the PC 
relativity and the branch prediction is 
incorrect, or when 'taken' was 
predicted but the branch prediction is 
incorrect (the cases ®, ® and ® in 
Fig. 4): there is one cycle of penalty. 

(c) When [the addressing mode is] the GR 
relativity or PC + GR and 'not-taken' 
was predicted but the branch 
prediction is incorrect (the cases © 
and © in Fig. 4): There are two 
cycles of penalties. 

Concerning control dependence, if there is 
no flow dependence of the TFR, 
everything is resolved in the BD stage. 
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6. Conclusion 



As explained above, the branch 
architecture and the branch pipeline 
processing in the DSNS processor were 
described. 

In order to reduce the influences of the 
branch penalties, the following techniques 
have been adopted: 

® Static branch prediction with a branch- 
target-buffer; 

<£ Speculative instruction execution; 
® Advanced conditioning; and 
® Early branch resolution. 

At present, hardware development is in 
progress. At the same time, a software 
simulator is being developed. The above- 
mentioned adopted four techniques are 
closely to related each other, so we are 
planning to evaluate interactions among 
these techniques in the future. 

Further, this branch architecture is 
premised on the existence of an 
optimizing compiler. Concerning the 
static branch prediction, the boosting and 
the advanced conditioning, their efficacies 
cannot be anticipated without the 
optimizing compiler. Especially, in order 
to make the boosting and the advanced 
conditioning be effective, the 
development of a high degree of static 
code scheduling algorithm is essential. 
We would like to report about these 
optimizing compiler technologies in a 
separate opportunity. 
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